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Abstract 

This paper establishes a statistical versus computational trade-off for solving a basic high- 
dimensional machine learning problem via a basic convex relaxation method. Specifically, we 
consider the Sparse Principal Component Analysis (Sparse PCA) problem, and the family of 
Sum-of-Squares (SoS, aka Lasserre/Parillo) convex relaxations. It was well known that in large 
dimension p, a planted fc-sparse unit vector can be in principle detected using only n ~ k\ogp 
(Gaussian or Bernoulli) samples, but all efficient (polynomial time) algorithms known require 
n ss k 2 samples. It was also known that this quadratic gap cannot be improved by the the most 
basic semi-definite (SDP, aka spectral) relaxation, equivalent to a degree-2 SoS algorithms. Here 
we prove that also degree-4 SoS algorithms cannot improve this quadratic gap. This average- 
case lower bound adds to the small collection of hardness results in machine learning for this 
powerful family of convex relaxation algorithms. Moreover, our design of moments (or “pseudo¬ 
expectations”) for this lower bound is quite different than previous lower bounds. Establishing 
lower bounds for higher degree SoS algorithms for remains a challenging problem. 


1 Introduction 

We start with a general discussion of the tension between sample size and computational efficiency in 
statistical and learning problems. We then describe the concrete model and problem at hand: Sum- 
of-Squares algorithms and the Sparse-PCA problem. All are broad topics studied from different 
viewpoints, and the given references provide more information. 

1.1 Statistical vs. computational sample-size 

Modern machine learning and statistical inference problems are often high dimensional, and it is 
highly desirable to solve them using far less samples than the ambient dimension. Luckily, we often 
know, or assume, some underlying structure of the objects sought, which allows such savings in 
principle. Typical such assumption is that the number of real degrees of freedom is far smaller 
than the dimension; examples include sparsity constraints for vectors, and low rank for matrices 
and tensors. The main difficulty that occurs in nearly all these problems is that while information 
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theoretically the sought answer is present (with high probability) in a small number of samples, 
actually computing (or even approximating) it from these many samples is a computationally hard 
problem. It is often expressed as a non-convex optimization program which is NP-hard in the worst 
case, and seemingly hard even on random instances. 

Given this state of affairs, relaxed formulations of such non-convex programs were proposed, 
which can be solved efficiently, but sometimes to achieve accurate results seem to require far more 
samples than existential bounds provide. This phenomenon has been coined the “statistical versus 
computational trade-off” by Chandrasekaran and Jordan [GJ13] , who motivate and formalize one 
framework to study it in which efficient algorithms come from the Sum-of-Squares family of convex 
relaxations (which we shall presently discuss). They further give a detailed study of this trade-off for 
the basic de-noising problem [Joh02[ [Don95l IDJ98] in various settings (some exhibiting the trade-off 
and others that do not). This trade-off was observed in other practical machine learning problems, 
in particular for the Sparse PCA problem that will be our focus, by Berthet and Rigollet |BR13aj . 

As it turns out, the study of the same phenomenon was proposed even earlier in computational 
complexity, primarily from theoretical motivations. Decatur, Goldreich and Ron [DGR.97] initiate 
the study of “computational sample complexity” to study statistical versus computation trade¬ 
offs in sample-size. In their framework efficient algorithms are arbitrary polynomial time ones, 
not restricted to any particular structure like convex relaxations. They point out for example 
that in the distribution-free PAC-learning framework of Vapnik-Chervonenkis and Valiant, there 
is often no such trade-off. The reason is that the number of samples is essentially determined 
(up to logarithmic factors, which we will mostly ignore here) by the VC-dimension of the given 
concept class learned, and moreover, an “Occam algorithm” (computing any consistent hypothesis) 
suffices for classification from these many samples. So, in the many cases where efficiently finding 
a hypothesis consistent with the data is possible, enough samples to learn are enough to do so 
efficiently! This paper also provide examples where this is not the case in PAC learning, and then 
turns to an extensive study of possible trade-offs for learning various concept classes under the 
uniform distribution. This direction was further developed by Servedio |SerOO| . 

The fast growth of Big Data research, the variety of problems successfully attacked by various 
heuristics and the attempts to find efficient algorithms with provable guarantees is a growing area of 
interaction between statisticians and machine learning researchers on the one hand, and optimiza¬ 
tion and computer scientists on the other. The trade-offs between sample size and computational 
complexity, which seems to be present for many such problems, reflects a curious “conflict” between 
these fields, as in the first more data is good news, as it allows more accurate inference and predic¬ 
tion, whereas in the second it is bad news, as a larger input size is a source of increased complexity 
and inefficiency. More importantly, understanding this phenomenon can serve as a guide to the 
design of better algorithms from both a statistical and computational viewpoints, especially for 
problems in which data acquisition itself is costly, and not just computation. A basic question is 
thus for which problems is such trade-off inherent, and to establish the limits of what is achievable 
by efficient methods. 

Establishing a trade-off has two parts. One has to prove an existential, information theoretic 
upper bound on the number of samples needed when efficiency is not an issue, and then prove a 
computational lower bound on the number of samples for the class of efficient algorithms at hand. 
Needless to say, it is desirable that the lower bounds hold for as wide a class of algorithms as possible, 
and that it will match the best known upper bound achieved by algorithms from this class. The most 
general one, the computational complexity framework of [DGR971ISer00] allows all polynomial-time 
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algorithms. Here one cannot hope for unconditional lower bounds, and so existing lower bounds 
rely on computational assumptions, e.g.”cryptographic assumptions”, e.g. that factoring integers 
has no polynomial time algorithm, or other average case assumptions. For example, hardness of 
refuting random 3CNF was used for establishing the sample-computational tradeoff for learning 
halfspaces |DLS13| . and hardness of finding planted clique in random graphs was used for tradeoff 
in sparse PCA |BR13al OMZ14) . On the other hand, in frameworks such as |CJ13] . where the 
class of efficient algorithms is more restricted (e.g. a family of convex relaxations), one can hope 
to prove unconditional lower bounds, which are called “integrality gaps” in the optimization and 
algorithms literature. Our main result is of this nature, adding to the small number of such lower 
bounds for machine learning problems. 

We now turn to describe and motivate SoS convex relaxations algorithms, and then the Sparse 
PCA problem. 

1.2 Sum-of-Squares convex relaxations 

Sum-of-Squares algorithms (sometimes called the Lasserre hierarchy) encompasses perhaps the 
strongest known algorithmic technique for a diverse set of optimization problems. It is a family 
of convex relaxations introduced independently around the year 2000 by Lasserre [Las m, Par- 
illo [ParOOj . and in the (equivalent) context of proof systems by Grigoriev [GriOlbj . These papers 
followed better and better understanding in real algebraic geometry [Art,271 iKnfil IStd74l ISho871 
!Sch91l IPut93l INesOOj of David Hilbert’s famous 17th problem on certifying the non-negativity of a 
polynomial by writing it as a sum of squares (which explains the name of this method). We only 
briefly describe this important class of algorithms; far more can be found in the book [Las m and 
the excellent extensive survey [Lau09j . 

The SoS method provides a principled way of adding constraints to a linear or convex program 
in a way that obtains tighter and tighter convex sets containing all solutions of the original prob¬ 
lem. This family of algorithms is parametrized by their degree d (sometimes called the number of 
rounds); as d gets larger, the approximation becomes better, but the running time becomes slower, 
specifically n°^ d \ Thus in practice one hopes that small degree (ideally constant) would provide 
sufficiently good approximation, so that the algorithm would run in polynomial time. This method 
extends the standard semi-definite relaxation (SDP, sometimes called spectral), that is captured 
already by degree-2 SoS algorithms. Moreover, it is more powerful than two earlier families of 
relaxations: the Sherali-Adams [SA90] and Lovasz-Scrijver [LS9T] hierarchies. 

The introduction of these algorithms has made a huge splash in the optimization community, 
and numerous applications of it to problems in diverse fields were found that greatly improve 
solution quality and time performance over all past methods. For large classes of problems they are 
considered the strongest algorithmic technique known. Relevant to us is the very recent growing set 
of applications of constant-degree SoS algorithms to machine learning problems, such as [BKS151 
IBKS141 IBM15j . The survey [BS14] contains some of these exciting developments. Section 12.31 
contains some self-contained material about the general framework SoS algorithms as well. 

Given their power, it was natural to consider proving lower bounds on what SoS algorithms can 
do. There has been an impressive progress on SoS degree lower bounds (via beautiful techniques) for 
a variety of combinatorial optimization problems [GriOlat iGriOlbl lSch08[ IMPW15] , However, for 
machine learning problems relatively few such lower bounds (above SDP level) are known [BM151 
IWGL15j and follow via reductions to the above bounds. So it is interesting to enrich the set of 
techniques for proving such limits on the power of SoS for AIL. The lower bound we prove indeed 
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seem to follow a different route than previous such proofs. 


1.3 Sparse PCA 

Sparse principal component analysis, the version of the classical PCA problem which assumes 
that the direction of variance of the data has a sparse structure, is by now a central problem of 
high-diminsional statistical analysis. In this paper we focus on the single-spiked covariance model 
introduced by Johnstone [Johfll] . One observes n samples from p-dimensional Gaussian distribution 
with covariance 

£ = A vv T + I (1.1) 

where (the planted vector) v is assumed to be a unit-norm sparse vector with at most k non-zero 
entries, and A > 0 represents the strength of the signal. The task is to find (or estimate) the 
sparse vector v. More general versions of the problem allow several sparse directions/components 
and general covariance matrix [M a!3l IVL13) . Sparse PCA and its variants have a wide variety of 
applications ranging from signal processing to biology: see, e.g., [ABN+991 ULMl IChelll [JOBlOj . 

The hardness of Sparse PCA, at least in the worst case, can be seen through its connection to 
the (NP-hard) Clique problem in graphs. Note that if E is a {0,1} adjacency matrix of a graph 
(with l’s on the diagonal), then it has a k -sparse eigenvector v with eigenvalue k if and only if 
the graph has a /c-clique. This connection between these two problems is actually deeper, and will 
appear again below, for our real, average case version above. 

From a theoretical point of view, Sparse PCA is one of the simplest examples where we observe 
a gap between the number of samples needed information theoretically and the number of samples 
needed for a polynomial time estimator: It has been well understood [ VL1211PJ121 IBR13bj that 
information theoretically, given n = 0(k\ogp ) sample^], one can estimate v up to constant error (in 
euclidean norm), using a non-convex (therefore not polynomial time) optimization algorithm. On 
the other hand, all the existing provable polynomial time algorithms [JL09L1AW091IVL131IDM14] , 
which use either diagonal thresholding (for the single spiked model) or semidefinite programming 
(for general covariance), first introduced for this problem in |dGJL07] . need at least quadratically 
many samples to solve the problem, namely n = 0(k 1 2 3 ). Moreover, Krauthgamer, Nadler and 
Vilenchik |KNV15| and Berthet and Rigollet }BR13b| have shown that for semi-definite programs 
(SDP) this bound is tight. Specifically, the natural SDP cannot even solve the detection problem : 
to distinguish the data in equation 11.11 above from the null hypothesis in which no sparse vector is 
planted, namely the n samples are drawn from the Gaussian distribution with covariance matrix /. 

Recall that the natural SDP for this problem (and many others) is just the first level of the 
SoS hierarchy, namely degree-2. Given the importance of the Sparse PCA, it is an intriguing 
question whether one can solve it efficiently with far fewer samples by allowing degree-d SoS al¬ 
gorithms with larger d. A very interesting conditional negative answer was suggested by Berthet 
and Rigollet [BR13bj. They gave an efficient reduction from Planted C%U(H problem to Sparse 
PCA, which shows in particular that degree-d SoS algorithms for Sparse PCA will imply similar 
ones for Planted Clique. Gao, Ma and Zhou [GMZ14| strengthen the result by establishing the 
hardness of the Gaussian single-spiked covariance model, which is an interesting subsefH of mod¬ 
els considered by |BR13aj . These are useful as nontrivial constant-degree SoS lower bounds for 

1 We treat A as a constant so that we omit the dependence on it for simplicity throughout the introduction section 

2 An average case version of the Clique problem in which the input is a random graph in which a much larger than 
expected clique is planted. 

3 Note that lower bounds for special cases are stronger than those for general cases 
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Planted Clique were recently proved by [MPW151 IDM15| (see there for the precise description, 
history and motivation for Planted Clique). As |BR13bl [GMZ14] argue, strong yet believed bounds, 
if true, would imply that the quadratic gap is tight for any constant d. Before the submission of 
this paper, the known lower bounds above for planted clique were not strong enough yet to yield 
any lower bound for Sparse PC A beyond the minimax sample complexity. We also note that the 
recent progress [RS1511HKP15 ] that show the tight lower bounds for planted clique, together with 
the reductions of |BR13al 1GMZ14) . also imply the tight lower bounds for Sparse PCA, as shown in 
this paper. 

1.4 Our contribution 

We give a direct, unconditional lower bound proof for computing Sparse PCA using degree-4 SoS 
algorithms, showing that they too require n = £l(k 2 ) samples to solve the detection problem 
(Theorem 13.11) , which is tight up to polylogarithmic factors when the strength of the signal A is a 
constant. Indeed the theorem gives a lower bound for every strength A, which becomes weaker as A 
gets larger. Our proof proceeds by constructing the necessary pseudo-moments for the SoS program 
that achieve too high an objective value (in the jargon of optimization, we prove an “integrality gap” 
for these programs). As usual in such proofs, there is tension between having the pseudo-moments 
satisfy the constraints of the program and keeping them positive semidefinite (PSD). Differing from 
past lower bound proofs, we construct two different PSD moments, each approximately satisfying 
one sets of constraints in the program and is negligible on the rest. Thus, their sum give PSD 
moments which approximately satisfy all constraints. We then perturb these moments to satisfy 
constraints exactly , and show that with high probability over the random data, this perturbation 
leaves the moments PSD. 

We note several features of our lower bound proof which makes the result particularly strong 
and general. First, it applies not only for the Gaussian distribution, but also for Bernoulli and other 
distributions. Indeed, we give a set of natural (pseudorandomness) conditions on the sampled data 
vectors under which the SoS algorithm is “fooled”, and show that these conditions are satisfied with 
high probability under many similar distributions (possessing strong concentration of measure). 
Next, our lower bound holds even if the hidden sparse vector is discrete, namely its entries come 
from the set {0, ±-^=}. We also extend the lower bound for the detection problem to apply also 
to the estimation problem, in the regime when the ambient dimension is linear in the number of 
samples, namely n < p < Bn for constant B (see Theorem 13.21) . 

Organization: Section [2] provides more backgrounds of sparse PCA and SoS algorithms. Then 
we state our main results in Section [3l In Section we design the pseudo-moments and state 
their properties and then in Section [5] we prove our main theorems using these moments. Section [6] 
and [7] contain the analysis of the moments. Section [8] lists the tools that we heavily used for 
proving concentration inequalities in the analysis. Finally we conclude with a discussion of further 
directions of study in Section [9j 

2 Formal description of the model and problem 

Notation: We use || • || to denote the euclidean norm of a vector and spectral norm of a matrix, 
|| • ||q to denote the (/-norm of a vector, and | • |o is the number of nonzero entries of a vector. 

We use [m] to denote the set of integers {!,..., m}. 
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We write M y 0 if M is a positive semidefinite matrix. 

is used to denote the set of real polynomials with n variables and degree at most d. We will 
drop the subscript n when it is clear from context. We will assume that n,k,p are all sufficiently 
larg^l, and that n < p. 

Throughout this paper, by “with high probability some event happens”, we mean the failure prob¬ 
ability is bounded by p~ c for every constant c, as p tends to infinity. 

We use the asymptotic notation O(-) and f2(-) to hide the logarithmic dependency (in p). That is, 
m < 0(f(n,p, k)) means that there exists universal constant r > 0 (which is less than 3 typically in 
this paper) and C such that m < Cf(n,p, k) log r p, and m > Cl(f(n,p,k)) means that there exist 
constants r and c such that m > cf(n,p,k)/log r p. 

2.1 Sparse PC A estimation and detection problems 

We will consider the simplest setting of sparse PC A, which is called single-spiked covariance model 
in literature (JohOlj (note that restricting to a special case makes our lower bound hold in all 
generalizations of this simple model). In this model, the task is to recover a single sparse vector 
from noisy samples as follows. The “hidden data” is an unknown k- sparse vector v E M p with 
|u|o = k and ||u|| = 1. To make the task easier (and so the lower bound stronger), we even assume 
that v has discrete entries, namely that v l E {0,±-^} for all i E [p]. We observe n noisy samples 
X 1 ,..., X n E that are generated as follows. Each is independently drawn as 

X j = V\g j v + & (2.1) 

from a distribution which generalizes both Gaussian and Bernoulli noise to v. Namely, the g v s are 
i.i.d real random variable with mean 0 and variance 1, and C’ s are hi.d random vectors which have 
independent entries with mean zero and variance 1. Therefore under this model, the covariance of 
X 1 is equal to Xvv T + I. Moreover, we assume that gi and entries of C are sub-gaussiar@ with 
variance proxy 0(1). Given these samples, the estimation problem is to approximate the unknown 
sparse vector v. 

It is also interesting to also consider the sparse component detection problem |BR13bl IBR13a] . 
which is the decision problem of distinguishing from random samples the following two distributions 

H 0 : data X 3 = f 7 is purely random 

H v : data AT = £■? + \/A g^v contains a hidden sparse signal with strength A. 

Rigollet |MR14j observed that a polynomial time algorithm for estimation version of sparse 
PGA with constant error implies that an algorithm for the detection problem with twice number of 
the samples. Thus, for polynomial time lower bounds, it suffices to consider the detection problem. 

We will use A as a shorthand for the p x n matrix [X 1 ,..., A n ]. We denote the rows of X as 
Xf,... ,Xp, therefore Xf's are ?r-dimensional column vectors. The empirical covariance matrix is 
defined as S = —XX T . 

n 

4 Or we assume that they go to infinity as typically done in statistics. 

5 A real random variable X is subgaussian with variance proxy a 2 if it has similar tail behavior as gaussian 
distribution with variance a 2 . More formally, if for any t € R, E[exp(tX)] < exp(f 2 cr 2 /2) 
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2.2 Statistically optimal estimator/detector 

It is well known that the following non-convex program achieves optimal statistical minimax rate 
for the estimation problem and the optimal sample complexity for the detection problem. Note 
that we scale the variables x up by a factor of \fk for simplicity (the hidden vector now has entries 
from {0, ±1}). 


^max(^) = \ ' max 
subject to 


(S, xx T ) 

(2.2) 

Mi = 

(2.3) 

Mo = k 

(2.4) 


Proposition 2.1 f (AW09j . |BR13b| . [ VL12 J informally stated). The non-convex program (|2.2D 
statistically optimally solves the sparse PCA problem when n > Ck/\ 2 \ogp for some sufficiently 
large C. Namely, the following hold with high probability. If X is generated from H v , then optimal 
solution x op t of program (12.21) satisfies ||| • XoptA^pt — vv T \\ < and the objective value A 1 A ( iax (S) 
is at least 1 + ^. On the other hand, if X is generated from null hypothesis Hq, then A aiax (£) is 
at most 1 + ^ . 

Therefore, for the detection problem, once can simply use the test A aiax (E) > 1 + ^ to distinguish 
the case of and H v , with n = Q(k/\ 2 ) samples. However, this test is highly inefficient, as the 
best known ways for computing A aiax (E) take exponential time! We now turn to consider efficient 
ways of solving this problem. 

2.3 Sum of Squares (Lasserre) Relaxations 

Here we will only briefly introduce the basic ideas of Sum-of-Squares (Lasserre) relaxation that 
will be used for this paper. We refer readers to the extensive |Lasl51 ILauf)9( IBS14] for detailed 
discussions of sum of squares algorithms and proofs and their applications to algorithm design. 

Let R[x]d denote the set of all real polynomials of degree at most d with n variables x±,... ,x n . 
We start by defining the notion of pseudo-moment (sometimes called pseudo-expectation ). The 
intuition is that these pseudo-moments behave like the actual first d moments of a real probability 
distribution. 

Definition 2.2 (pseudo-moment). A degree-d pseudo-moments M is a linear operator that maps 
M[x]rf to M and satisfies M( 1) = 1 and M(p 2 (x)) > 0 for all real polynomials p(x) of degree at most 
d/2. 

For a mutli-set S C [n], we use x s to denote the monomial n,:gs x i- Since M is a linear operator, 
it can be clearly described by all the values of M on the monomial of degree d, that is, all the values 
of M(x s ) for mutli-set S of size at most d uniquely determines M. Moreover, the nonnegativity 
constraint M(p(x) 2 ) > 0 is equivalent to the positive semidefiniteness of the matrix-form (as defined 
below), and therefore the set of all pseudo-moments is convex. 

Definition 2.3 (matrix-form). For an even integer d and any degree-d pseudo-moments M. we 
define the matrix-form of M as the trivial way of viewing all the values of M on monomials as a 
matrix: we use mat (M) to denote the matrix that is indexed by multi-subset S of [n] with size at 
most d/2, and mat (M)s,t = M(x s x T ). 
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Given polynomials p{x) and qi(x ),... ,q m (x) of degree at most d, and a polynomial program, 

Maximize p(x) (2-5) 

Subject to qi(x) = 0,Vi € [m] 

We can write a sum of squares based relaxation in the following way: Instead of searching over 
x € R n , we search over all the possible “pseudo-moments” M of a hypothetical distribution over 
solutions x, that satisfy the constraints above. The key of the relaxation is to consider only moments 
up to degree d. Concretely, we have the following semidefinite program in roughly n d variables. 

Variables M(x s ) VS : |S| < d 

Maximize M(p(x)) (2-6) 

Subject to M(qi(x)x K ) = 0 Vi ,K : \K\ +deg(gj) < d 
mat (M) V 0 

Note that (12.61) is a valid relaxation because for any solution x* of (12.51) . if we define M(x s ) to be 
M(x°) = xf, then M satisfies all the constraints and the objective value is p(x*). Therefore it is 
guaranteed that the optimal value of (12.61) is always larger than that of 02.5[) . 

Finally, the key point is that this program can be solved efficiently, in polynomial time in its size, 
namely in time n°( d \ As d grows, the constraints added make the “pseudo-distribution” defined 
by the moments closer and closer to an actual distribution, thus providing a tighter relaxation, at 
the cost of a larger running time to solve it. 

In the next section we apply this relaxation to the Sparse PC A problem and state our results. 


3 Main Results 

To exploit the sum of squares relaxation framework as described in Section l2.3| . we first convert 
the statistically optimal estimator/detector (12.2|) into the “polynomial” program version below. 


Maximize (S,xx T ) (3-1) 

subject to ||x||| = k (3.2) 

xf = Xi,Vi <E [p] (3.3) 

|x|i < k (3.4) 


Note that the non-convex sparsity constraint (12.4|) is replaced by the polynomial constraint 13.31 
which ensures that any solution vector x has entries in {0, ±1}, and so together with the constraint 
(13.21) guarantees that it has precisely k non-zero entries, each of absolute value 1. Note that 
constraint (13.31) implies other natural constraints that one may add to the program in order to 
make it stronger: for example, the upper bound on each entry Xj, the lower bound on the non-zero 
entries of Xj, and the constraint ||x|| 4 > k which has been used as a surrogate for fc-sparse vectors 
in jBKS141 lBKS15j . Note that we have also added an t\ sparsity constraint (13.41) (which can be 
easily made into a polynomial constraint) as is often used in practice and makes our lower bound 
even stronger. Of course, it is formally implied by the other constraints, but not in low-degree SoS. 

Now we are ready to apply the sum-of-squares relaxation scheme described in Section 12.31) to 
the polynomial program above as . For degree-4 relaxation we obtain the following semidefinite 







program SoS^S), which we view as an algorithm for both detection and estimation problems. 
Note that the same objective function, with only the three constraints m, C 2 ]), ([C 6 ]) gives 
the degree-2 relaxation, which is precisely the standard SDP relaxation of Sparse PCA studied 
in [AW091 lBR13bl lKNV15|. So clearly SoS^E) subsumes the SDP relaxation. 

Algorithm 1 SoS^E): Degree-4 Sum of Squares Relaxation 
Input: E = \XX T where X = [X 1 ,..., X n ] € R pxn 

Solve the following semidefinite programming and obtain optimal objective value SoS^E) and 
maximizer M*. 

Variables: M(S), for all mutli-sets S of size at most 4. 


SoS4(E) = max y^ M(xjXj)T,ij (Obj) 

subject to E M(x?) = k (Cl) 

*e[p] 

Y I M{xiXj)\ < k 2 (C2) 

b'e[p] 

M{xfxj) = M(xiXj), Vi,j € \p] (C3) 

y M(xjx s x t ) = k ■ M(x s x t ), Vs, t € \p\ (C4) 

*e[p] 

y \M(xiXjX s x t )\ < k A (C5) 

Mh 0 (C6) 


Output: 1. For detection problem : output H v if SoS^E) > (1 + )k, Hq otherwise 

2. For estimation problem: output M| = {M*(xiXj)) i 


Before stating the lower bounds for both detection and estimation in the next two subsections, 
we comment on the choices made for the outputs of the algorithm in both, as clearly other choices 
can be made that would be interesting to investigate. For detection, we pick the natural threshold 
(1 + )k from the statistically optimal detection algorithm of Section 12.21 Our lower bound 
of the objective under H 0 is actually a large constant multiple of A k, so we could have taken 
a higher threshold. To analyze even higher ones would require analyzing the behavior of S 0 S 4 
under the (planted) alternative distribution H v . For estimation we output the maximizer MJ of 
the objective function, and prove that it is not too correlated with the rank -1 matrix vv T in the 
planted distribution H v . This suggest, but does not prove, that the leading eigenvector of M| 
(which is a natural estimator for v) is not too correlated with v. We finally note that Rigollet’s 
efficient reduction from detection to estimation is not in the SoS framework, and so our detection 
lower bound does not automatically imply the one for estimation. 
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3.1 Lower bounds for detection problem 

For the detection problem, we prove that SoS^E) gives a large objective value on null hypothesis 
Ho- 

Theorem 3.1. There exists absolute constant C and r such that for 1 < A < min{/c 1//4 , \/n} and 
any p > CXn, k > CX 7 ^ 6 ^/nlog r p, the following holds. When the data X is drawn from the null 
hypothesis Hq, then with high probability (1 — p~ 10 ), the objective value of degree-4 sum of squares 
relaxation SoS^E) is at least 10 AA:. Consequently, Algorithm [T] can’t solve the detection problem. 

To parse the theorem and to understand its consequence, consider first the case when A is a 
constant (which is also arguably the most interesting regime). Then the theorem says that when 
we have only n <C k 2 samples, degree-4 SoS relaxation S 0 S 4 still overfits heavily to the randomness 
of the data X under the null hypothesis Hq. Therefore, using SoS^E) > (1 + ^)k (or even 10AA:) 
as a threshold will fail with high probability to distinguish Hq and H v . 

We note that for constant A our result is essentially tight in terms of the dependencies between 
n,k,p. The condition p = 0(n) is necessary since otherwise when p = o(n), even without the 
sum of squares relaxation, the objective value is controlled by (1 + o(l))k since E has maximum 
eigenvalue 1 + o(l) in this regime. Furthermore, as mentioned in the introduction, k > Q(y/n) is 
also necessary (up to poly-logarithmic factors), since when n 2 > k 2 , a simple diagonal thresholding 
algorithm works for this simple single-spike model. 

When A is not considered as a constant, the dependence of the lower bound on A is not optimal, 
but close. Ideally one could expect that as long as k 2 > A y/n, and p > An, the objective value on 
the null hypothesis is at least 0(A k). Tightening the A 1 / 6 slack, and possibly extending the range 
of A are left to future study. 

3.2 Lower bounds for the estimation problem 

For estimation problem, we prove that output by Algorithm |T] is not too correlated with the 
desired rank -1 matrix vv T . 

Theorem 3.2. For any constant B there exists absolute constants C and r such that for A < 
B/2, Bn > p > 2A n and o(p) > k > Cy/n\og r p, suppose the data X is drawn hypothesis H v 
(model (12.11) 1. then with high probability (1 —p^ 10 ) over the randomness of the data, Algorithm |T| 
will output M| such that ||| • — vv T \\ > 1/5. 

We observe that the result is of the same nature (and arguably near-optimal for estimation 
problem) as }KNV15| achieve the for degree-2 SoS relaxation. The proof follows simply from 
combining our detection lower bound Theorem 13.11 and arguments similar to }KNV15| . Finally we 
address a threshold-like behavior of the estimation error. Note that while our Theorem proves that 
n = Xi(k 2 ) samples is necessary for efficient algorithms to get even constant estimation error, it 
is known [YZ13, lMal3i IWLL14] that slightly more samples, n = 0(k 2 ), can already achieve in 
polynomial time a much smaller (and optimal) estimation error, namely 0(\/(A:log p)/n). 

4 Design of Pseudo-moments 

We start with a sketch of our approach to the design of the moments M at a very high level, 
highlighting aspects of their design which are different than in previous lower bounds. First, there 
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are some natural choices to make. We define the degree-2 moments M from the input as the 
empirical covariance matrix, as was done in the proof of the SDP lower bound. This already gives a 
large objective value (see Lemma l4.2p . We also define taking odd moments (degree 1 and 3) to be 0. 
The difficult part is designing the degree-4 moments consistently with the constraints and M. We 
do this in stages, first approximating the constraints (indeed even M only approximately satisfies, 
in a way we will specify in Section |4~T1 constraints (1C1I) and (1C2I) ) and later in Section fl~ 2 l correcting 
the moments to satisfy the constraints precisely. Moreover, we separately use different 4-moments 
for different constraints and then combine them, as follows. We define two different degree-4 
PSD moments P and Q such that (with high probability) P almost satisfies constraints (1C-Hi . 
(1C5I) and m, and negligible for constraint del (see Lemma 14. 4p . whereas Q almost satisfies 
constraints (|C5D . (1C4D and (|C6|) . and negligible for (|C3D (Lemma 14.51) . Therefore taking the sum 
P + Q will almost satisfy all the constraints (Lemma 14.61) . which completes the design of the 
approximate moments. Finally we “locally” adjust P + Q so that the resulting moments M exactly 
satisfy all the constraints (Theorem 14.71) . and remain PSD with high probability. 

All moments will be defined from the data matrix X, to which we first apply a simple pre¬ 
processing step: we scale all its rows to have square norm n (around which they are concentrated). 
We abuse notation and call the scaled matrix X as well. Note that when the noise model in the 
null hypothesis Hq is Bernoulli, namely the entries of X are chosen as unbiased independent ±1 
variables, the rows are automatically scaled, which motivates our abuse of notation. We suggest 
that the reader thinks of this distribution, even though the proof works for a much wider class of 
distributions. 

The properties above of our moments will be proved under the assumption that the scaled 
matrix X satisfies the “pseudo-randomness” condition below. This set-up allows us to encapsulate 
what we really need the data to satisfy, and thus prove our lower bound not only for Gaussian 
or Bernoulli noise, but actually for a larger family containing both. Namely, we later prove in 
Section 0 via a series of concentration inequalities, that when data is drawn from null hypothesis 
H 0 , its scaling X satisfies the pseudorandomness condition with very high probability under all 
these noise models. Note that this condition is actually a sequence of statements about deviation 
from the mean of various polynomials in the data - these will become natural once we define our 
moments. 

Condition 4.1 (Pseudorandomness Condition). Our constructions of the moments will only require 
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the following pseudorandom conditions about the (scaled) data matrix X , 
||Aj || 2 = n Mi € [p] 

\{Xi,Xj)\ < d(y/n) Mi / j 


E (X^XefiXj^t 

te[p]\iuj 


<0(n L ' b p), Mi / j 


E ^ > X *) ( X * ’ 

£e[p] 


<0(n z p b ) V disctinct 


'y^(Xj,X s ){Xj,X t ) <0(n' 5 p ) V distinct s,i 
ie[p] 

E (X u X e ) 2 (X s ,X e )(X t ,X e ) < 0(n L V), V distinct s,t 
||XX T ||^ > (1 — o(l))np 2 


(PI) 

(P2) 

(P3) 

(P4) 

(P5) 

(P6) 

(P7) 


4.1 Approximate Pseudo-moments 

In this section, we design a pseudo-moments M that approximately satisfies the all the constraints. 
Then in the next subsection we will locally adjust it to obtain one that exactly satisfies all of the 
constraints. 

We begin by designing a (partial) degree-2 moments that gives large objective value, which will 
be later used for the degree-4 moments. The design is essentially the same as [KNV15j though we 
only work with null hypothesis for now. For the purpose of this section, we suggest the reader to 
think of X as having uniform {±1} entires for simplicity, though as we will see later, we assume 
that X satisfies certain pseudorandomness condition which holds if X is chosen from a variety of 
natural stochastic models (with row normalization). We define M : M[x ]2 —> M as follows: 

M(x iXj ) 4 ^± ij = -£(X i ,X j ) Mi,j € \p\ (4.1) 

M{xi) = 0 Vz E \p] 

M( 1) = 1 


where 7 is a constant that to be tuned later according to the signal strength A. Note that by 
design mat (M) is a PSD matrix. We can check straightforwardly that M satisfies constraint 
and gives a large objective value (|Qbj[). 


Lemma 4.2. There exists constant C such that for p > 771 and k > C'jy/nlogp, suppose X satisfies 
Condition l4.ll then M is a valid degree-2 pseudo-moments and satisfies the sparsity constraint (IC2D . 


E \ M ^ x i)\ < fe2 /2 (4.2) 

jj'ebl 


and has objective value 

E M(x i x j )t ij > (1 - o( 1 )) 7 A; 
jj'ebl 
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Moreover, we also have M(x 2 ) = < |, and M(xiXj ) < 0( ' rk ^ ). 

Proof. The proof follows simple calculation and concentration inequality. Since ||W|| 2 = n for all i 
and with high probability over the randomness of X, for all i j, \{Xi,Xj)\ < 0(y/n) , we obtain 
that M(xf) = and M(xiXj ) < 0( lk ^™ ). Then to verify equation (j4.2|> . we have 

|M(®ia;j)| < k + O^ky/n) < k 2 /2 

*j6[p] * *tD' 

when fc jy/n. Finally, we can verify the objective value is large 

V MixiX^tij = ^Y / (X i ,X j ) 2 = -1 \\XX t \\ 2 f > (l-o(l)) 7 fc 
z ' p z n z —^ xrn 

i,je\p] 

where we use the fact that ||XX T |||, > (1 — o(l))p 2 n (see property (1P7I) in Condition 14.11) . □ 

Note that M doesn’t satisfies constraint (1C ID exactly. However, we could simply fix this by 
defining M'(xiXj) = M(xiXj ) for all i ^ j and M'{x 2 ) = k/p. However, note that we will use 
a perturbation of M' in our final design in Section 14.21 so that it is consistent with the degree-4 
moments. 

Corollary 4.3. There exists absolute constant C such that for p > 771 and k > C^y/nlogp, there 
exists a degree-2 pseudo-moments M' that satisfies constraints {OB, del and give objective value 
at least (1 — o(l))^k. 

Now we define a degree-4 pseudo-moment that approximately satisfies all the constraints in 
SoS^E) and give a large objective value. We keep the current (approximate) design M for degree- 
2 moments, since the degree-2 moments defined in previous section seems to be nearly optimal and 
enjoys many good properties. Then we define M(S) = 0 for any multi-set S of size 3, because 
apparently degree-3 moments don’t play any role the semidefinite relaxation. 

The main difficulty is to define M(S ) for S of size 4. Here we have three constraints ()C3D . 
(|C4j) . and (IC5I) . and the PSDness constraint that implicitly compete with each other. We took 
the following approach. We let M be a sum of two matrices matrix P and Q. We ensure that P 
“almost” (as will be specified later) satisfies ()C3D and (1C5D . and is negligible for constraints IC4I In 
turn Q is negligible for constraints (IC3I) and “almost” satisfies constraint (IC4I) and (IC5D . Therefore 
P + Q will “almost” satisfies constraints (|C3j) i and (1C4D ) . and satisfy the sparsity constraint (jC5j) . 
Moreover, P and Q will be PSD by definition. Concretely, we define 

) = P(i,j,s,t) + Q(i,j,s,t) 


where P and Q are defined as 




p2 n 3 


J2 (X, Xl) (Xj , x e ) (X s , Xi) (X t , X t ) 

£e\p] 


7 k 2 


Q(i,j,s,t) = ({X^XtHX^Xj) + (X il X t )(X s ,X ] ) + (Xj,X t )(Xi,X a )) 
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We note that P and Q are well defined pseudo-moments because they are invariant to the 
permutation of indices and naturally PSD. To see the PSDness, note that P is a sum of p rank-1 
PSD matrices. Moreover, Q is also PSD: the part that correpsonds to (X s ,X t )(Xi,Xj) is simply 
a rank-1 PSD matrix; (Xi, Xt)(X s , Xj) can be written as (Xt <S> X s , X t <g > Xj) and therefore it also 
contributes a PSD matrix to Q. Similarly, (Xj, Xt)(X s , Xi) can be written as (X s < 8 > Xt, Xi <g> Xj), 
and it also contributes a PSD matrix. 

In the next two lemmas (one for P and one for Q ), we formalize the intuition above by showing 
that, the deviation from P and Q exactly satisfying the constraints is captured by error matrices 
£,F,Q. We bound the magnitude of these error matrices and establish the PSDness of some of 
them so that later we can fix them for the exact satisfaction of the constraints. 

Lemma 4.4. There exists some absolute constant C and r such that for 1 < 7 < min {k 1,/4 , y/n}, 
1 < 7 < n, p = l.l 7 n, and k > C ■ r f 7 ^ e y/nlog r p, suppose X satisfies pseudorandomness condi¬ 
tion [hTJ then P almost satisfies constraint (|C3l) and (|C5j) . naturally satisfies PSD constraint (IC 6 I) . 
and is negligible for constraint (1C4I) in the sense that 


P(xfxj) = M(xiXj) + Fij, Vi, j € \p\ 

^2 P ( x i x s x t) = £st Vs, t € \p] 
i 

Y t | P(xiXjX s x t )\ < A; 4 /3 

i,j,s,t 

where F and £ are p x p matrices that satisfy 

1. 0 < Fa < 6 , \Fij\ < O for any i and j / i. 

2. £ is PSD with |f ss | < O ^ 77 ^, and \£ s t\ < 0(^=) for any s / t. 

Lemma 4.5. There exists some absolute constant C and r such that for 1 < 7 < n, p = l.l 7 n 
and k > C ■ 7 y / nlog r p, suppose X satisfies pseudorandomness condition 14.11 then Q is negligible 
for constraint (1C3I) ) and almost satisfies constraint (1C4I) and (1C5D in the sense that, 

< 2(4 

i 

Ei 

i,j,s,t 

where Q is a p x p PSD matrix |C/ SS | 

Now we are ready to prove that 
approximately. 


3^._ 

x j) = — M(xiXj ) Vi, j 

(4.6) 

p 


}(xjx s x t ) = kM (x s x t ) + Gst, Vs, t 

(4.7) 

Q(xiXjX s x t ) | < k 4 /3 

(4.8) 


< O (yp) and \g st \ < O (^=) for any s/f. 

M = P + Q almost satisfies all other constraints (1C1I) - (|C 6 |) 


(4.3) 

(4.4) 

(4.5) 
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Lemma 4.6. Define M{xiXjX s xt) = P(xiXjX s xt ) + Q(xiXjX s xt) for all i,j,s,t € [p], then we have 
under the condition of Lemma 14.41 

M(x\xj) = M ( XiXj ) + (4.9) 

Y. Mjx 2 x s x t ) = kM(x s x t ) + £'st Vs,t (4-10) 

*e[p] 

y] < 2 /c 4 /3 (4-11) 

i,j,s,t 


where J 7 ' and £' are p x p matrices that satisfy 

1 . < 6 (^) and < 6 (^) for all * / j. 

2. £' is a PSD matrix with £' ss < O ^ and \£' st \ < 0(^=) for s / t. 


Proof of Lemma f.6 using Lemma \4-4\ an d Lemma \f . 5[ Note that by definition of M and Lemma. l4.4l 
and Lemma l4~5l we have F[ 3 = Fi 3 + jfM(xiXj) and £' = £ + Q. The bound for F follows the 

bound for F and the facts that M(x 2 ) = 4^ and \M(x{Xj)\ < O f\ . The PSDness of £' 


and the bounds for it follows straightforwardly from those of £ and Q. Equation (14.111) follows 
equation (14.41) and equation (|4.8I) . □ 


4.2 Exact Pseudo-moments 

Note that M only satisfies the constraints approximately up to some additive errors (which are 
carefully bounded for the purpose of the next theorem). We fix this issue by defining the actual 
pseudo-moments M based (on a carefully chosen) local adjustment of M. Concretely, we define 
M( 1) = 1 and for all add degree monomial x a , M(x a ) = 0. For distinct i,j,s,t, we define 
M(xiXjX s xt) = M(xiXjX s xt ) and M{x 2 x s xt) = M{x 2 x s xt). For distinct s,t, we define 

M(x 3 s x t ) = M{x a x t ) 4 M(x s x t ) + ^ {£' st - 2 F' st ) (4.12) 

and M(x 2 x 2 ) = M(x 2 x 2 ) + 5 where <5 a constant (will be proved to be nonnegative) such that 

Y M{x 2 s x 2 t ) = ^ ( M{x 2 s x 2 t ) + = k 2 — k (4.13) 

i^j s+t 

Then we define 

M(xf) = Mix 2 ) 4 J— ^ Mixjx 2 ) (4.14) 

jW* 

Therefore we can see by construction, it is almost obvious that M satisfies all the linear con¬ 
straints (1C1I) . (IC3I) . (IC4I) exactly. Moreover, since £' and F' are small error matrices, most of 
the entries M{xiXjX s xt ) are equal or close to M(xiXjX s xt). Note that M satisfies the rest of con¬ 
straints (IC2I) . (1C5D and (IC6D (even with some slackness). We will prove that the difference between 
M and M is small enough so that these constraints are still satisfied by M. 
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Theorem 4.7. Under the condition of Lemma 14.41 suppose X satisfies pseudorandomness condi¬ 
tion [TT1 then the pseudo-moments M defined above satisfies all the constraint (1C1I) - (1C6I) of the 
semidefinite programming and has objective value larger than (1 — ofdjjyA;. 

Proof. We prove that M satisfies all the constraints in an order that is most convenient for the 
proof, and check the objective value at the end. 

• Constraint (|C3D : This is satisfied by the definition of M. 

• Constraint (1C4I) : By the definition, we can see that M(x 3 xt) is also a perturbation of M(x 3 xt). 

M(XgXt) = M(x s x t ) + (£' st - 2T' st ) = M(x 3 x t ) + ( 4 -!5) 


It follows that for s ^ t, 


T, M(x 2 x s x t ) 
*e[p] 


2 M(x s x t )+ y M(x 2 x s x t ) 

i£\p]\{s,t} 

2 M{x s xt) + j, _ 2 st ~~ ZF,t) ’y ' M(xiX s xt) 

i£[p]\{s,t} 

M (. x 3 s x t ) + M ( x s x t 3 ) - 2 F' st + -j~y (£' st - 2F' st ) + y M(xfx s x t ) 

i£[p]\{s,t} 

kM(x s x t ) + £' st - 2 F' st + (£' st - 2F' st ) 

kM(x s x t ) 


where the second equality uses definition (|4.12p and the third uses equation (14. 91) . and the 
fourth uses (14.101) and the last equality uses the definition (|4.12l) again. 

Moreover, for the case when s = t, we have that 


yM( X *x 2 s ) 

ie[p] 


M(x 4 s ) + y M(x 2 x 2 ) 

*e[p]\{s} 

M(x 2 ) + (k — 1 )M(x 2 ) = kM(x 2 ) 


where we used the definition (14.141) of M(x 4 ) and M(x 2 ). Therefore we verified that M 
satisfies constraint m- 

• Constraint m : Using equation (14.131) and (14.141) . we have 


y M(x 2 ) = y M{x 2 x 2 ) = k 

*G[p] i^j 
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• Constraint (IC 6 I) : 

Next we check the PSDness of matrix rnat(M). Note that mat(M) is indexed by all the mutli 
subset of [p] of size at most 2 , and it consists of 3 blocks mat(M) = blkdiag(M 4 , M 2 , Mo), 
where 

M 4 = (mat(M ) SiT ) |s|=2)|T|=2 
M 2 = (mat(M) 5 ,r ) |s|=li|T|=1 
M 0 = 1 

Therefore it suffices to check that Mg, M 2 and M 4 are all PSD. Mq is trivially PSD. We can 
write M 2 in the following form 

M 2 = (. M{x s x t )) ste]p] = [M{x s x t )^ + A 

where A = M 2 — ( M(x s xt)) . By equation (|4.12D . we have that for s 7 ^ t, A st = 

V 2 s,te[p] 

(£' st — 2T' st ) for all s / t. Moreover, by definition of M{x 2 ) and M(x 2 x 2 ), we have that 


M(x 2 s ) 


1 


- y M ( x l x i) = ]~i (■ M ( x2 s x t ) + 5 


k- \ ^ v s w k - 1 

s-.s^t s:s^t 

1 /, W, 9, „/ TV/ 4,\ P 1 


fc- 

1 


fc- 1 


- (kM(x 2 ) + £' ss - M(xf) S j 
j ( kM(x 2 ) + £' ss - M(x 2 ) - • 5 


k- 
M{x 2 ) + 


k- 1 


(C - ^«) + "* 


(4.16) 


where second line uses equation (14.1011 and the third line uses (14.91) . and therefore A ss = j- • 
8 + (£' s — ). We extract the PSD matrix jY ■ £' form A and obtain A' = A — ■ £' ■ 

Then by this definition, A' ss = jg • 8 + ^ {£' ss - F' ss ) - ^5^, and A' t = We 

use Gershgorin Circle Theorem to establish the PSDness of A'. By Lemma 14.61 we have 
(T, < O (A#) Therefore 




4 

fc - 2 


O 



< o 



where we used the fact that y/n/p = o(l) which follows form p = l.lyn and 7 < yfn. 
Using equation (14.161) and constrain (1C ID we have that 


k 


£ m(xD = £ v(xj)+E rh: ^ + £ frr • 5 

s s s 

<2*2 + o(i) + ?Cp- 1 ). < 

p K — 1 


(4.17) 

(4.18) 
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It follows that 5 > (1—°(1))'• Therefore we obtain that Ah = j-A+pirj- (£' as — P' ss )— 

kh£'ss > ^(1 - o( 1))| - 0(1) - O = ^(1 “ °(X ))^■ Therefore we obtain Ah > 

| Ah | and by Gershgorin Circle Theorem A' is PSD. 

Now we examine M±. we write M 4 as 


M 4 = rnat(P) + rnat(Q) + T 

where T = M 4 — (mat(P) +mat(Q))- One can observe that T has only non-zero entries of 
the form 


= M(xf) - P(xf) - Q(xf) = M(x\) - M(x\) = M(xj) - M(z?) - 


(P~ !) 
k — 1 


5 + 


k- 1 


?[. - 

'll 


k — 1 


T 

J 1.1. 


(4.19) 


and 


Vi / j, Tujj = Vijjj = T ijji = M (: xjxj) - P(xfxj) - Q(xfxj) 

= M(x?Xj) — M(x?Xj) = <5 


(4.20) 


and 


Vi 7 h j, Tjj^j — Tjjjj — A'l(x^Xj) P(x i xj) Q(x i Xj) 

c st 


= M(xfxj) — M(xfxj) = 


k 


k — 2 k — 2 


T. 


St 


(4.21) 


where the last equality uses equation (14.151) . 

Now we are ready to prove PSDness of T. We further decompose T as T = r 7 + blkdiag(A 7 ,0) 
where A 7 is the p x p matrix with A 7 = <511 T H. Note that A 7 is a PSD matrix and therefore 
it suffices to prove that T 7 = T — blkdiag(A 7 , 0) is a PSD matrix. 

Note that T 7 has ij-th column the same as ji-th column, and therefore it’s only of rank at 
most p+p(p — l)/2. We define T 77 be the p + p(p— l)/2 by p+p(p— l)/2 submatrix of r 7 ,that 
is indexed by subsets (i,i) for i £ [p] and (i,j) for i < j. Therefore it suffices to prove that 
r 77 is PSD. We prove it using Gershgorin Circle Theorem. 

Note that by equation gT9]), we have that P 77 ^ = Ph^ - Ah u • <5 + - p^Pn- 

Therefore by the lower bound for 5 and Lemma [4.61 we obtain, P 77 - > (1 — o(l)) —. Moreover, 


P" = r 

L ii,ij 


= — and therefore |T 


k 

k -2 


< I £ 'st I I I H f , 

— I k—2 ' ' I k—‘2X s ^ 


n,ij 


<o( 


iA -x) + 0 


7 k 2 y/n 


^3— ^ 


< 


0(- T 3 -). Furthermore, for i < j, T'W = Tij^j = <5 > (1 — o(l))^-. Finally observe that 
r 77 ■ ■ = 0 by definition and all other entries of T 77 are trivially 0 because the corresponding 
entries of F and A 7 vanish. Therefore we are ready to use a variant of Gershogorin Circle 
Theorem fLemma 18.81) to prove the PSDness of T 7 . Taking a = I/ 7 2 , we have for any i, 


6 A' is index by ii, i = 1,..., p 
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a l^stl 22 

s,t:(s,t)^=(i,i),s<t j£[p] j:j>i j-j<i 

< ap ■ O(-Yg) = O ("-") < T'O- 

n \P J 

where we used the fact that k S> 7 y/n and e is a constant. 

Moreover, for any i < j, we have that 


E 


r" 1 < ip" 1 -i- ir" I 

i],st\ — \*-i],n\ ' 


— ^( n 1.5 ) 0 ( „2 ) — ^7,7 


where we used k > 7 4 and k 7 -y/n. Therefore by Lemma 18.81 we obtain that T" is PSD. 
Constraint (|C2I) : Using Lemma 14.21 and equation (14.121) . we have that 

|M(.T s x t )| < ^M(.t 2 ) + 22 | M{x s x t ) - M(x s x t )| + ^ |M(x s x t )| 

S,t SS 

<fc + p 2 0 (^) + fe 2 / 2 <fc 2 

n L '° 


• Constraint ()C5D : Finally, we check that M satisfies the sparsity constraint (1C5D . 


S \M(xiXjX a xt)\ < 22 l r ij,st\ + E \M(xiXjX s x t )\ 

i,j,s,t i,j,s,t i,j,s,t 

< k 4 

where we used (|4.1ip and the (trivial) facts that r,j rSt < 0(k/p ) for any i,j, s,t and there are 
only at most 0(p 2 ) nonzero entries in T. 

• Objective value ( |Obj[ ): Note that by constraint (1CID and Lenmia POl we have that JT M(x 2 ) 
k > then 

22 M ( x i x j)^i,j > ^ M(xjXj)% i j - 22 I M(xi x j) - M(xiXj)\\%j\ 
kj i,j i¥=j 

> (1 - o(i)) 7 fc - p 2 ■ 6 (—j2\ ■ o ( 

\ny/nj \y/nj 

> (1 — o(l)) 7 fc 

where in the second inequality we used Lemma 14.21 and the facts that E^- = -f (X t , Xj) < 
0(1 /y/n) and \£'\%j + \X[-\ < O (~rx) ■ and the last line uses the fact taht 7 4 < k. 

□ 
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5 Proof of Theorem 13.11 and Theorem 13.2 


In this section, we prove our main Theorems using the technical results of the previous sections. 
Before getting in the proof, we start with the observation that in order to get a lower bound of 
objective value 10A k, it suffices to consider the special case when p = lOAn. The reason is that 
the objective value of S 0 S 4 is increasing in p while fixing all other paramters: Suppose p' < p 
and £' E M p,xp/ is a submatrix of £, and M*' : My [a :]4 is the maximizer of SoS 4 (£'). Then we 
can extend M' to M : K p [x ] 4 —>• M by simply defining that M{x s ) = M*'(x s ) if S C \p'} and 0 
otherwise. This preserves all the constraint and objective value. Thus we proved that the objective 
value for £ is at least the one for S'. Formally, we have 

Proposition 5.1. Fixing A ,k,n, given a data matrix X E M pxn , and any submatrix matrix Y E 
W p xn of X with p' < pi, let Yx and £y be the covariance matrices of X and Y, then we have that 
SoS 4 (S x ) > SoS 4 (Sy). 

Now we are ready to prove our main Theorem 13.11 The idea is very simple: we normalize the 
data matrix X so that the resulting matrix X satisfies the the pseudorandomness condition 14.11 
Then we apply Theorem 14.71 and obtain a moment matrix which give large objective value with 
respect to X. Then we argue that the difference between X from X is negligible so that the same 
moment matrix has also large objective value with respect to X. 

Proof of Theorem, 1 3. 11 Using the observation above, we take p = 1.1 yn with 7 = 11 A, and we define 
X to matrix obtained by normalizing rows of X to euclidean norm yfn. Then by Theorem 17.11 it 
satisfies the pseudorandomenss condition 14.11 Let S' = y t XX T be the covariance matrix defined 
by X. By Theorem 14.71 we have that SoS 4 (S') > (1 — o(l)) 7 £; > 11A k. Moreover, let M be 
the moment defined in Theorem 14.71 and M 2 its restriction to degree-2 moments, that is, M 2 = 
(mat(M)s ) r)| S |_| T |_ 1 . We are going to show that the entry-wise difference between £ and S' are 

small enough so that (M 2 , £) is close to (M 2 , S'). 

Note that since ||W|| 2 = n±0(y/n), therefore for any i ^ j, £U = p^j^y-jj = £jj±0(-^)|£|jj = 
£ ij =t O(^). For i = j, we have similarly that £h = £ a ± 0(-A=). We bound the difference between 
(M 2 , S') and (M 2 , £') by the sum of the entry-wise differences: 

Km 2 , e' — s)i < ^ - sy + X] M{x iX j)\%j - sy 

i i^j 

< p ■ o(k/p ) • O(-L) +p 2 ■ 5(2^) • o(-) = o(k) 
y/n n 

Therefore (M 2 ,£) > (l—o(l)) r yk—o(k) = (l—o(l))'yk. Therefore the moment M gives objective 
value (1 — o{l))^k for data £, and therefore SoS 4 (S) > (1 — o(l))^k > lOAfe. □ 

Then we prove that Theorem 13.II together with the arguments in |KNV15| implies Theorem 13.21 
The intuition behind is the following: Suppose M| is very close to vv T , then it is close to rank-1 
and its leading eigenvector is close to v. However, since we prove that the objective value is large 
(which is true also in the planted vector case), M| needs to be highly correlated with £, which 
implies its leading eigenvector v needs to be correlated with £, which in turns implies that v is 
correlated with £. However, it turns out that v is not correlated enough with £, which leads to a 
contradiction. 
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Proof of Theorem \3.2\ . We first prove that the optimal value of SoS^S) for hypothesis H v is also 
at least 0.99 kp/n. Suppose v has support S of size k. We consider the restriction of linear operator 
M to the subset T = [p]\£, denoted by Mt■ That is, we have that Mt(x“) = 0 for any monomial 
x a that contains a factor x% with * G S, and otherwise Mx(r") = M(x a ). We also consider the 
data matrix Xt obtained by restricting to the rows indexed by T. Note that since Xt doesn’t 
contain the signal, and k 3> y/n), using Theorem 14.71 with 7 = (p — k)/( l.Oln), we have that there 
exists pseudo-moment Mf which gives objective value > (1 — o(l))'yk > 0.99 pk/n with respective to 
covariance matrix Xt = Note that by Proposition l5.ll SoS^S) > SoS 4 (St) and therefore 

we obtain that under hypothesis H V} with high probability, SoS^S) > 0.99 kp/n. 

Now suppose M* is the maximizer of SoS 4 (S), and = (M*(xiXj)) i For the sake of 

contradiction, we assume that ||^M| — vv T \\ < 1/5. We first show that this implies that M| 
has an eigenvector v that is close to v and its eigenvalue is large. Indeed we have 11 iP/ 2 11 > 
\\vv T \\ — 1/5 = 4/5. Therefore the top eigenvector of yM£ Fas eigenvalue larger than 4/5. Then 
we can decompose the difference between y ■ M| and vv T into y ■ M* — vv T = | • ( vv T — vv T ) + 
(dm - | vv T ). Note that since (yMf ~ ^vv T ) is a PSD matrix with eigenvalue at most 

1/5, we have || — \vv T ^ || < 2/5 by triangle inequality. Then by triangle inequality 

and our assumption again we obtain that 

1 T\ 11 ,,4 Vy 11 ^ 

- 7 VV > 7 ™ — VV ) — - 

4 J " - "5 v ’ 5 

Therefore we obtain that \\vv T — vv T \\ < 3/5 and therefore \\vv T — vv 1 \\ 2 F < 2\\vv T — vv 1 || 2 < 1. 
It follow that |(u, h)| 2 = 1 — \\\vv T — vv 7 ||^ > 1/2. 

Next we are going to show that it is impossible for M| to have an eigenvector that is close to 
v with a large eigenvalue and therefore we will get a contradiction. Let v = av + /3s where s is 
orthogonal to v and a 2 + /3 2 = 1 and a > y/l/2, and j3 < ^/l/2. Then using triangle inequality 

we have that ||D||£ < ||au||f, + ||/3s||^ < y/0{X)a + yj ||E||/3. Proposition 5.3 of |KNV15] implies 
that for sufficiently large C and A > 1, when p/n > C A, ||S|| < 1.01 p/n. Therefore under our 
assumption we have that ||£|| < 1.01 p/n. It follows f3 < y/\/2 that ||D||j, < yJO{X)a + yfj3 ■ yjp/n < 
yJO( A) + yjp/2n. Therefore, we have that 

1 ^1 -1 4 k ~ 4 

0.99 p/n < - • SoS 4 (S) = - • (M 2 , X) = — ■ (M% —— • vv T , S) + -(?>0 T , £) 

< ^tr(M 2 * - ^ • 'Oh T )||S|| 2 + ^||u||| 

< i||S|| + 0(a 2 A) + O(^XpJn) + ^ • p/n 

0 o 

< \ ■ — + 0(y/Xp/n) 

5 n 

where in the third line we used the fact that ||D||^ < yJO{ A) + y/p/2n , and the last line we used 
||E|| < l.Olp/n.Note that this is a contradiction since we assumed that p/n > C A for sufficiently 
large C. 

□ 
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6 Analysis of matrices P and Q 

In this section we prove Lemma 14.41 and 14.51 They basically follow direct calculation and the 
pseudorandomness properties of data matrix X listed in Condition 14.11 

Proof of Lemma \4-4\ Note that since p = 1.1 771 and 1 < 7 < n, we have that 0(n 2 ) > p > n. We 
verify equations (14.31) . (14.41) and (14.51) and the bounds for F and S one by one. 

• Equation (14.31) : 

For the case when i = j, we verify P(xf) using property (1P1D and (1P2I) . 


P(xf) = 


7 k 
p 2 n 3 


(x i ,x i ) 4 + (Xi,x e y 

£e[p]\i 


< 


l k / 3 


p2 n 3 


n d (Xi,Xi) + 0(pm 


= Mix 2 ) + O 


7 /c 

pn 


For distinct i,j, we have that 


Pixfxj) = 


7 k 

p2 n 3 

7 k 

p2 n 3 


{X^XifiX^Xj) + (X^XjfiX^Xi) + ^ (X^XtfiXj^e) 

££\p]\iUj 


n 6 {Xi , Xj) ± 0(n 2 ' 5 ) ± 0(pn 


,1.5 


= M(x t Xj) ± O 
= M{xi.Xj)±0 


7 k 
p 2 n - 5 
7 k 


±0 


7 k 
pn 1 - 5 


pn 


1.5 


where in the second equality we use equation (IP3[) . and p > n. 
• Equation (14.51) : 

Note that for distinct i,j,s,t, by equation (1P4I) . we have 


\P(XiXjX s X t )\ = 


7 k 
p 2 n 3 


( Xi > X *) ( X i ’ x i) < x ‘> X ti & > 

te[p] 


< O 


7 fc 

pl.5 n 


Therefore taking the sum over all distinct i,j, s,t we have 


E | P{x iXj x s x t )\ < O • V 

+ rlioLinr'-f \P / 


4 = 0 1 ) < U/4 


i,j,s,t distinct 

where we used L 2 > 7 7 / 6 y / n, which implies that fc 3 3> 7 p 2 5 /n. 


n 


( 6 . 1 ) 
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By equation (14.2D and equation (14.311 . we have that 


£ \P(xfxj)\ < £ |M(a7Xj)| +p 2 -O +p-0 < k 2 /2 + 0 (7 ky/n) < k 2 (6.2) 

*7 *7 ' VP 77 -/ 


where we used the fact that p < n 2 and A; S> 'yy/n. 
For distinct i,s,t , we have that 


J2(Xi,X e ) 2 {X s ,X t ){X t ,X t ) 

— 

£ (Xi,x e ) 2 (x s ,x e )(x t ,x e ) 

+ \{Xi,Xi) 

2 (X s ,X i )(X t ,X i )\ 

e&[p\ 






+ 

{X i ,X s ) 2 {X s ,X s ){X u X s )\ + |(X i ,X f ) 2 (X s ,Xi)(X*,X t )| 


= 0{pn 2 ) + 0(n 3 ) + 0 (n 2,5 ) = 0(pn 2 ) 


It follows that 


|P(x 2 x s xt)| 


7 A; 

p 2 n i 


J2(Xi,X e ) 2 (X s ,X e )(X t ,X e ) 

£e\p\ 


< o 



and therefore, 


£ iP(xf M) |<pva(2f) = a(^ 

V nn/ \ n 


i,s,t disctinct 

Therefore, combining equation (16.11) . (16.21) . (16.31) . we obtain that 


(6.3) 


£ \p(xiXjx s x t )\ < k 2 +o C~~ \ + ^ 4 / 4 ^ £3 

i,j,s,t ' ' 


• Equation (14.41) : 

Finally it remains to bound E. Note that £ is a sum of submatrices of P and therefore it is 
PSD. Moreover, 


Sss = £ P( 
ie[p] 


x 2 iX 2 s ) = ^££<^,x,) 2 (x s ,x,) 5 


p 2 n 3 


7 k 
p 2 n 3 

7 A; 


£pq,x,) 2 £(x s ,jQ) 2 


< 2 3 


0(p 2 n 2 ) = O ( — 
n 
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where the last inequality uses equation (1P2D . Finally we bound £ st using equation (IP6I) 


Y p (x 2 x s x t ) = 

[p] 


< 


p 2 n 3 


EE« j X() (Xi , Xf) (X s , X() (X t , Xf) 
i e 


7 k 

p 2 n 3 


0(p 2 n 1 ' 5 ) 


o( 


'yk 


n 


1.5 


) 


□ 


Proof of Lemma Again we verify equation (14.61) . (14.71) and (14.81) in order. 
• Equation (14.61) : By definition we have that for any i,j, 


Q( x f Xj ) = • 3 n(Xi,Xj) = ^(X^Xj) = —M(x iXj ) 

p A n P P 

Equation (14. 8|) : For the sparsity constraint, we note first that for distinct i,j,s,t, using 
property (IP2I) . we have 


7 k 2 


\Q{xiXjX s x t )\ < o • 0(n) = O 

pi> n 


7 k 2 


and therefore taking sum, we have 

Y \Q(xiXjX s x t )\ < 0(7 k 2 p) < k 4 /6 

i,j,s,t disctinct 


where we used the fact that k 2 c 2 n. We bound other terms as follows: 

For any i,j. s,f G [p], we have that 


Q(xiXnX s Xt ) < '~7r • 3n 2 
p A n 


3y k 2, n 

p3 


(6.4) 


There are only at most 0(p 3 ) different choices of i,j,s,t such that i,j,s,t are not distinct, 
therefore we have 

Y \Q(xfxj)\ < H ■ 0(p 3 ) < k 4 /6 (6.5) 

i,j not distinct ^ 

where we used the fact that k 2> 7 \fn and 7 > 1. 

Combining equation (16.41) and (16.51) . we obtain that 


Y \Q{xiXjX s x t )\ < k 4 /3 

i,j,s,t 
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Equation (14.711 : For any s,t , we have 


J2Q(x-XsXt) = ^]T(n(X s ,X f } + 2{X i ,X s ){X i ,X t )) 

* P n *e[p] 

= ^ (x s , X t ) + ^ V (Xi, x a ) (X, x t ) 

F i£\p\ 

Therefore Q st = £ ieW (. Xi,X s )(Xi,Xf ) forms a PSD matrix. Moreover, when s / t, 

using equation (IP5I) . we have that 

^ Q(xfx s :a 


When s = t, we have that 

X] Q( x i x 2 s) 


□ 


) = kM(x s x t ) ± * 0{p^/n) 


p°n 


= kM(x s x t ) ± O ( 


FV n 


,_„ 2^y^ 2 ___ 

= kM(x s xt ) ± — 5 — • O (pn) 

p 6 n 

= kM(x s x t ) ± O (^T 


7 Pseudo-randomness of X 


In this section, we prove that basically as long as the noise model is subgaussian and has variance 
1 (which generalizes the standard Bernoulli and Gaussian distributions), after normalizing the rows 
of the data matrix X ~ Hq, it satisfies the pseudorandomness condition 14.11 


Theorem 7.1. Suppose independent random variables X\,, X p € 


satisfy 


Xi has a i.i.d entries with mean zero, variance 1 , and subgaussian variance prox}Q 


matrix X with 


xt 

P 7 [ 


as rows satisfies the pseudorandomness condition 14.11 


that for any i, 
0 ( 1 ), then the 


The proof of the Theorem relies on the following Proposition and Theorem 17.41 The proposition 

xT 

says that still satisfies good properties like symmetry and that each entries has a subgaussian 
tail, even though its entries are no longer independent due to normalization. These properties will 
be encapsulated in the definition of a good random variable following the proposition. Then we prove 
in Theorem 17.41 that these properties suffice for establishing the pseudorandomness Condition 14.11 
with high probability. We will heavily use the ^ Q -Orlicz norm (denoted || • ||^, a ) of a random variable, 
defined in Definition 18.11 and its properties, summarized in the next (toolbox) section. Intuitively, 
|| • ||^, 2 norm is a succinct and convenient way to capture the “subgaussianity” of a random variable. 


'A real random variable X is subgaussian with variance proxy a 2 if it has similar tail behavior as gaussian 
distribution with variance a 2 , and formally if for any t£t, E[exp(tA')] < exp(t 2 a 2 /2) 
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Proposition 7.2. Suppose y € has i.i.d entries with variance 1 and mean zero, and gaussian 
variance proxy 0 ( 1 ), then random variable x = satisfies the following properties: 

1 . ||a ;|| 2 = n, almost surely. 

2 . for any vector a € M n with ||o || 2 < 2 n, ||(x,a )|| 2 2 < 0(n). 

3. i||x|oo||^2 < 0(1) 

4. E[xf] = 1, K[xf\ = O 4 , and E[x 2 x 2 ] = 0 * 2,2 for all i and j ^ i, where 04 , 0 * 2,2 = 0(1) are 
constants that don’t depend on i,j 

5. For any monomial x a with an odd degree, E[®“] = 0. 

For simplicity, we call a random variable good if it satisfies the five properties listed in the proposition 
above. Goodness will be invoked in most statements below. 


Definition 7.3 (goodness). A random variable x € M n is called good , if it satisfies the conclusion 
of Proposition 17.21 

We will show a random matrix X with good rows satisfies the pseudo-randomness Condition 14.11 
with high probability. 

Theorem 7.4. Suppose independent n-dimensional random vectors X\ ,..., X p with p > n are all 
good, then X \,..., X p satisfies the pseudorandomness condition 14.11 with high probability. 

The general approach to prove the theorem is just to use the concentration of measure. The 
only caveat here is that in most of cases, the random variables that we are dealing with are not 
bounded a.s. so we can’t use Chernoff bound or Bernstein inequality directly. However, though 
these random variables are not bounded a.s., they typically have a light tail, that is, their i/j a norms 
can be bounded. Then we are going to apply Theorem 18.41 of Ledoux and Talagrand’s, a extended 
version of Bernstein inequality with only ij; a norm boundedness required. We will also use other 
known technical results listed in the toolbox Section [HJ 

Proof of Theorem \7.4\ Equation (1P1D and (1P2D follows the assumptions on Xfs and union bound. 
Equation (lid) is proved in Lemma 17.51 by taking u = X s and v = X t and view the rest of X t ;s as 
Zj 's in the statement of Lemma 17.51 Equation (1P4|1 is proved in Lemma 17.61 (IP5I) in Lemma 17.81 
(1P61) in Lemma 17.101 and equation IP 71 is proved in Lemma 17.151 □ 

Lemma 7.5. For any good random variable x, we have that for fixed u,v with ||u || 2 = ||u|| 2 = n , 
Moo < 6 ( 1 ), Moo < 0 ( 1 ), and (u,v) < 0{y/n), 

|E (a::, - u) 3 (.c, t’)j I < 0(n l ' r ’) 


and moreover, for p > n and a sequence of good independent random variables Z \,..., Z p , we have 
that with high probability, 

p 


]{Zi,u) 3 (Zi,v) 


< 0(n l ' b p) 


1=1 
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Proof. We calculate the expectation as follows 


E [(x,it) 3 (x,u)] = E 


y ifixj +2 y u^x^j y vtuix 2 +y uiVjXiXj 

i<j ) V i i/j 


E 

y 

+ E 

y U 2 UjVjX 2 X 2 

+ E 

y UiUj{UiVj + ViUj)x 2 X 2 j 


_ i 


f¥=j 


f¥=j 


(c 4 - 6*2,2) y ufvi +C2, 2n y MjUj+6*2,2 y^Vi+^^0 

* * 

(6*4 - 362,2) y ujvi+ 36 2 , 2 n y 1 


UiVi 


Therefore by our assumption on u and u we obtain that 

|E [(a;, u) 3 (x, u)] | < O(n) + 6>(n)|(«, v)| < C^n 1 ' 5 ) 

Now we prove the second statement. Since \\{Zi,u)\\^ 2 < 0(y/n), by Lemma 1831 we have that 
\\(Zi,u} 3 (Z il v)\\^ 1/2 < 6>(n 2 ), and it follows LemmaEUthat || {Z i: u) 3 (Zi, u)-E [{Z u u) 3 (Zi, u)] ||^ 1/2 < 
0{n 2 ) Then by Lemma 18.41 we obtain that with high probability, 


y (Zi,u) 3 (Zi,v) - E 


i= 1 


y (Ziiu) 3 ^ 


i=l 


< 0(n 2 y/p ) 


Note that we have proved that |E [X)f > =i(-^o u) 3 {Zi, u)] | = 0(n 1,5 ), therefore we obtain the 
desired result. 

□ 

Lemma 7.6. Suppose p > n and X\ ,... , are good independent random variables, then with 
high probability, for any distinct i,j,s,t, 


y (Xi , X t ) (Xj , X t ) (X s , X t ) (X t , x e ) 

£e[p\ 


< 0(n 2 y/p) 


Proof. Fixing i,j,s,t, we can write 


y {Xi , Xt) {Xj , Xt) {X s , X,) (X t ,x e )= y (Xi , Xt) {Xj , X*) (X, , Xt) {X t , X*) 

t£[v\ ee\p]\{i,j,s,t} 

+ n{X j ,X i ){X s ,X i ){X t ,X i ) + n{X i ,Xj){X s ,Xj){X t ,X j ) 
+ n{Xi , X s ) {Xj , X a ) (W , X a ) + n(Xi, X t ) {Xj , X t ) (X a , X t ) 


Using Lemma m the first term on RHS is bounded by 0{n 2 y/p) with high probability over the 
randomness of Xt,£ E [p]\{«, j, s, i}. The rest of the four terms are bounded by 0(n 2 ' 5 ). Therefore 
putting together \\J2e £ [ p ]{Xi, Xi){Xj, X e ){X s , X £ ){X t , X ( )\\ < 0{n 2 y/p) for any fixed i,j,s,t with 
high probability and taking union bound we get the result. □ 
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Lemma 7.7. For any good random variable x, and for fixed a, b, c, d such that max{|a|oo, |6|oo, |c|oo, |d|oo} 
0(1), and all the pair-wise inner products between a,b,c,d have magnitude at most 0(y/n), we 
have that 

|E [{x,a)(x,b)(x,c)(x,d)] \ = 0(n) 

and moreover, for p > n and a sequence independent random variable Zi,... ,Z p such that each Z % 
satisfies the conclusion of proposition 17.21 we have that with high probability, 


v 

B z " a)(Zi,v){Zi,c){Zi,d) 
1=1 


< 0{n 2 y/p) 


Proof. We calculate the mean 


E [(x,a)(x,b)(x,c)(x,d)] =E 

X aibiCidixf + < 

X aibiCjdjxjx 2 

[1 


_ie[p] 


J 


where we use aibiCjdjxfx 2 j to denote the sum of dibiCjdjxfx 2 and all its permutations 

with repect to a, b, c, d. 

Note that 


E 

aibiCjdjxfx 2 


— C*2,2 

( 0 , b)(c, d) - y aibiCidi 





i&\p\ 


< 0(n ) 


and 


E 


E aibjCidixf 


ie[p\ 


O 4 ^ ^ CLibiCidi 
i£[p\ 


< 0(n) 


and therefore we have |E [(x, a)(x, b)(x, c)(x, d)} \ < O(n). 

Since (x, a) has ^2 norm \/n and similar for the other three terms, we have that by Lemma 18.51 
that \\(x,a)(x,b)(x,c)(x,d)\\ 1 p 1/2 < 0(n 2 ). Therefore using Theorem 18.41 we have that 


XI {Zi,a)(Zi,v)(Zi,c){Zi , d) - e[£ ( Zi ’ a )( z i,v)(Zi,c){Zi , d)} 


i=l 


i=l 


< 0(n 2 y/p) 


^1/2 


□ 


Lemma 7.8. Suppose p > n and X \,... ,X p are good independent random variables, then with 
high probability, for any distinct s,t, 


J2(^X s )(X u X t ) 


< 0(py/n) 


i&[p\ 

Proof. With high probability over the randomness of Xi,i € [p]\{s,t}, 


J2{Xi,X s ){Xi,X t ) = J2 {Xi,Xs){Xi,X t ) + 2 (X„X t ) < 0{pVn) + O(^) 

ie[p] *e[p]\f>,t} 

where the last inequality is by Lemma 17.91 Taking union bound we complete the proof. 


□ 
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Lemma 7.9. For p > n and a sequence of good independent random variable Z \,..., Z pi and any 
two fixed vectors u, v with |n|oo < 0(1) and |n|oo < 0(1), and (u,v) < 0(yfn) 1 we have that with 
high probability, 


^2(Zi,u)(Zi,v) 

«e[p] 


< 0(py/n ) 


Proof. E [(Zi,u)(Zi,v)\ = (u,v) < 0(y/n), and therefore 

that \\(Zi, u)\\ip 2 < 0{y/n) and therefore || (Zi,u)(Zi,v 
desired result. 


E 


J2ie\p\( z i, u )( z i, v 
< 0(n). By Theorem 


< 0(py/n). Note 
we have the 

□ 


Lemma 7.10. Suppose p > n and X ±,..., X p are good independent random variables, then with 
high probability, for any distinct s, t, 

J2 (Xi,X e )(X. i ,X e )(X s ,X e )(X t ,X e ) < 0(p 2 n 1 ' 5 ) 

i/€\p] 

Proof. We expand the target as follows: 

E (X i ,X e )(X i ,X e )(X s ,X e )(X t ,X e )= (Xi,X e ){Xi,X e )(X a ,X e )(X t ,X e ) 

j,£e[p] *e[p],fe[p]\sut 

+ ^(Xi, X S ) 2 {X S , X s )(X t , X a ) + ^(X i; X t ) 2 {X t ,X t )(X s ,Xt) 

i i 

J2 > x t) > x i) ( x * > x t) ( x t , 

iG[p]\sUt,£G[p]\sUt 

+ £<X 2 , X s ) 2 (X s ,X s )(X t ,X s ) + £<X i; X t ) 2 (X tl X t )(X s ,X t ) 

i i 

+ (x s ,x e ) 3 (x e ,x t ) + J2 {Xt,X t ) 3 {Xt,X s ) 

fe[p]\sut £s[p]\sut 

By equation (1P3D . we have that 

£ (X s ,X{) 3 (X(,X t ) < 0(pn 15 ) 

£e[p]\sut 

Since (X s ,X t ) < 0{y/n) and Y,ie[p]( X i, X s ) 2 = n 2 + J2i^ s (Xi, X s ) 2 < d(np), we have that 

Y J {XuX s ) 2 {X s ,X s ){X u X s ) < b(pn 2 ' 5 ) 

i 

Invoking Lemma 17.111 with u = X s and v = Xt fixed and view X(, £ £ [p]\s U t as random variables 
Zi s, we have that with high probability, 

X] (X u X e )(X u X e )(X s ,X e )(X u X e ) < 0(p 2 n 1,5 ) 

?e[p]\sut,te[p]\sut 

Hence combining the three equations above, taking union bound over all choices of s,t, we 
obtain the desired result. □ 
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Lemma 7.11. For p > n and a sequence of good independent random variables Z \,..., Z p , and 
any two fixed vectors u,v with |ii|oo < 0 ( 1 ) and |n|oo < 0 ( 1 ), and (u,v) < 0(y/n), we have that 
with high probability, 

£ ^(ZiiZjfiZji^iZjiv) < 0(pV- 5 ) 

ie[p] ie[p] 

Proof. We first extract the consider those cases with i = j separately by expanding 

{Zj, Zj) (Zj,u)(Zj,v) = Zj) ( Zj,u)(Zj,v ) + ^y^fZj^Zj) ( Zi,u)(Zi,v) 

*e[p]je[p] « 

= ^{ z i, z j) 2 { z j,u)(Zj,v)+0(pn 2 - 5 ) ( 7 . 1 ) 

where the the last line uses Lemma 17.91 Let Yi,..., Y p are independent random variables that have 
the same distribution as Z i,..., Z p , respectively, then by Theorem 18.71 we can decouple the sum 
of functions of Z t , j Zj into a sum that of functions of Zj and Y), 


Pr 

^2( z i, z j) 2 ( z j,u)( z j, v ) > t 

<C Pr 

J2(Yu z j) 2 (Z j ,u)(Z j ,v)>t/C 


&3 


&3 


Now we can invoke Lemma 17.121 which deals with RHS of the equation above, and obtain that 
with high probability 

Y / (yi,Z j ) 2 (Z j ,u)(Z j ,v) < 0(pV- 5 ) 

*7O 

Therefore, with high probability, 

Y^( z u z j) 2 ( z j: u )(Zj,v) < 0 (p 2 n L5 ) 

#3 

Then combine with equation (17.111 we obtain the desired result. □ 

Lemma 7.12. For p > n and a sequence of good independent random variables Z i,..., Z p , let 
Yi,..., Y p be independent random variables which have the same distribution as 2),..., Z p , respec¬ 
tively, then for any two fixed vectors u,v with Irtloo < 0(1) and |n|oo < 0(1), and (u,v) < 0(y/n), 
with high probability, 


£ O'ij Z j) 2 ( Z ji u )( Z ji v ) < o(pV- 5 ) 

ie[p] j£\p] 


Proof. Let B = X^ie[p] • Therefore by Lemma 17.131 we have that with high probability over 

the randomness of Y, \\B \\2 < 0(p), tr (B) = pn. Moreover, by Lemma 17.91 we have that with high 
probability, \u T Bv\ < 0{py/n). Note that these bounds only depend on the randomness of Y, and 
conditioning on all these bounds are true, we can still use the randomness of Zfs for concentration. 
We invoke Lemma 17.141 and obtain that 


E < z >. y 

■ ) 2 (Z j ,u)(Z j ,v) 

= 

p 

^ZjSZj(Zj,«)(Zj,u) 

*j'e[p] 



3 =1 


< 0(p 2 n 15 ) 


□ 
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Lemma 7.13. For p > n and a sequence of good independent random variables Z\,..., Z p , we 
have that with high probability, 


E z -Z 


< 0(p) 


Proof. We use matrix Bernstein inequality. First of all, we have that E [ZiZf ] = I n xn , and therefore 


E 


[X)ie[p] ZiZ ? 


= pln.-x.n- Moreover, we check the variance of the Z ? ; Zj : 


E [Z^f ZiZf] = n'E[Z i Zf] = nl nxn 

Finally we observe that \\ZiZf\\ < n. Thus applying matrix Bernstein inequality we obtain that 
with high probability, 


^ ^ Zj Z) plnxn 
*e[p] 


< 0(y/np + n) = 0(y/rvp) 


□ 


Lemma 7.14. For p > n and a sequence of good independent random variables Z 1 ,..., Z pi and 
for any fixed symmetric PSD matrix B E M nXTl with ||5|| < 0(p), tr (B) < 2 pn, and any two fixed 
vectors u,v with |ii|oo < 0(1) and |u|oo < 0(1), and (u,v) < 0{y/n ), we have that with high 
probability over the randomness of Zf s, 


p 

Y J z lBZ i {Z i ,u){Z u v) 

2=1 


< 0(p 2 n 1 ' 5 ) 


Proof. Let W = x T Bx(x, u){x, v), where x is a random variable that satisfies the conclusion of 
Proposition 17.21 We first calculate the expectation of IF, 


E [IF] = E 




uX 2 + 22 X i X j B 



22 UiViX 2 + ^ XiXjUiVj 
i i^j 


= (0 4 - 0 2 , 2 ) 22 B n U i V i + 0 2 ,2tr (B) (u, v) + E 

i 


22 BijiuiVj + Ujvf)x 2 X 2 

*7O' 


= (o 4 - 36*2,2) 22 + tr ( B ) ( u i v ) 

i 


Therefore by the fact that |u|oo < 0(1) and tr (B) < 2 pn, we obtain that |e[VF]| < 0(pn 1 - 5 ). 
Observe that ZjBZj < 0(pn) a.s. (with respect to the randomness of Zj), and \\(Zi,u)(Zi,v)\\^ 1 < 
0(n), therefore we have that \\Zj BZj(Zi,u)(Zi,v)\\^ 1 < 0(pn 2 ). Using Theorem 18.41 we obtain 
that with high probability, 


p 


p 

22 z f Bz i( z i,u)( z i,v) 

-E 

22 z f Bz i( z i,u)( z i,v) 

2=1 


.2=1 


< 0(nV' 5 ) 
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Using the fact that E \Zj BZ{{Zi ,u){Zi,v)] < 0(pn 15 ) we obtain that with high probability 


J2&?BZ i (Z i ,‘u)(Z i , 


i— 1 


< 0(p 2 n 1 ' 5 ) 


□ 


Lemma 7.15. Suppose p > n and X\,... ,X p are good independent random variables, then with 
high probability, 

\\XX T f F > (1 - o(l))p 2 n 

Proof. We first i and examine J2j^i(Xj, Xi) 2 first. We have that E[J2j^i{Xj, Xf) 2 ] = (p— l)||Xj|| 2 = 
(p — l)n. Moreover, \\(Xj,Xi ) 2 \\^ 1 < 0{n ) (where Xj is viewed as random and 7Q is viewed as 
fixed). Therefore by Theorem 18.41 we obtain that with high probability over the randomness of 
Xj s, (j ^ i ), Yljj=i(Xj, Xi) 2 = (p — l)n ± 0(riy/p) = (1 ± o(l))pn. Therefore taking union bound 
over all i. and taking the sum we obtain that 

\\XX T \\l > > (1 - o(\))p 2 n 


8 Toolbox 


This section contains a collection of known technical results which are useful in proving the concen¬ 
tration bounds of Section [71 We note that when the data matrix X takes uniformly {±1} entries, 
then X satisfies Proposition 17.21 without any normalization and actually due to the independence 
of the entries, it’s much easier to prove that it satisfies Condition 14.11 

Definition 8.1 (Orlicz norm || • ||^, a ). For 1 < a < oo, let ifaix) = exp(x“) — 1. For 0 < a < 1, 
let tf a (x) = x a — 1 for large enough x > x a , and if a is linear in [0, x Q ]. The Orlicz norm if a of a 
random variable X is defined as 


X\\^ a = inf{c G (0,oo) | E[ip a (\X\/c) < 1] 


( 8 . 1 ) 


Note that by definition ip a is convex and increasing. The following Theorem of Ledoux and 
Talagrand’s is our main tool for proving concentration inequalities in Section [71 

Theorem 8.2 (Theorem 6.21 of (l.Tl.ll I. There exists a constant K a depending on a such that 
for a sequence of independent mean zero random variables X \.... , X n in L x j Ja , if 0 < a < 1, 


and if 1 < a < 2, 


where l/a + 1/f3 = 1. 



<K a f 


+ 

max \\Xi | 

) 

i 

Ipa ' 

i 

i 


be/ 




< 




+ (E 


i * 


( 8 . 2 ) 


(8.3) 
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The following convenient Lemma allows us to control the second part of RHS of (18.21) easily. 


Lemma 8.3 ( (vdVWOO] ). There exists absolute constant c, such that for any real valued random 
variables X\, ..., X n , we have that 


max I X; 

l<i<n 


< cip a 1 (n) max WX^ 

l<i<n 


Using Lemma f8.3l and Theorem 18.21 we obtain straightforwardly the following theorem that will 
be used many times for proving concentration bounds in this paper. 


Theorem 8.4. For any 0 < a < 1, there exists a constant K a such that for a sequence of 
independent random variables ..., X n , 


E^-eE*] 


< K a y/n\og n ■ max || JQ|L 
i 

i’a 


which implies that with high probability over the randomness of Xfs, 


(8.4) 




< 0{K a \fn ■ max \\Xi ||^, a ) 
i 


The following two lemmas are used to bound the Orlicz norms of random variables. 

Lemma 8.5. There exists constant D a depending on a such that, if two (possibly correlated) 
random variables X , Y have Orlicz norm bounded by ||X||^, a < a and ||T|L < b then 

11-^'^’ llba/2 — D a ab 

Proof. For any x,y, a,b,a > 0, 

exp (\xy\T' 2 - 1 < exp Q|s|“ + l -\y\^ - 1 

— ^ ((exp \x\ a — 1) + (exp \y\ a — 1)) 


Moreover, note that by definition of ip a , there exists constant C a and C' a such that for x > 0, 
C'o(exp(x“) — 1) > i/j a (x) > C a (ex p(.x“) — 1). Therefore we have that there exists a constant E a 
such that ip a / 2 {\ x y\) 5: ^r{'f’a(\ x \)+'f’a(\y\))■ Also note that for any constant c, there exsits constant 
c' such that ijj a (x/c!) < if a (x)/c. Therefore, choosing D a such that if a / 2 { x /D a ) < if Q (x)/E a for 
all x > 0 we obtain that 


E 


Vty2( 


\XY\ 

abD n 


< E 


, ,\XY\. 


/E a < - (E[l6«(|X|/o)] + K[M\Y\m < 1 


□ 


Lemma 8.6. Suppose random variable X has -i/^-Orlicz norm a, then X — E[A"] has Orlicz 
norm at most 2a. 
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Proof. First of all, since if a is convex and increasing on [0,oo), we have that E [if a (\X |/a)] > 
V>a(lE[|Af|]/a) > if a (\ E[-A]|/a). Then we have that 


E 




\X-E[X]\. 
2 a 




Kf a {\X\/a) + ^ a (| E[-X]|/a) 


< E[^a(|-X'|/o)] < 1 


where we used the convexity of ip a and the fact that E [^ a (|.X’|/a)] > i/) a (\ E[A]|/a) 


□ 


The following Theorem of [PMS95) is useful to decouple the randomness of a sum of correlated 
random variables into a form that is easier to control. 


Theorem 8.7 (Special case of Theorem 1 of |PMS95j ). Let X \,..., X n , Y\,... ,Y n are independent 
random variables on a measurable space over S, where Xj and Y t has the same distribution for 
i = 1,... ,n. Let •) be a family of functions taking S x S to a Banach space ( B , || • ||). Then 
there exists absolute constant C, such that for all n > 2, t > 0, 


Pr 


Yjfa&uXi) 

> t 

<C Pr 


EAfW.tf) 

> t/C 



i¥=j 







The following lemma provides a simple way to prove the PSDness of a matrix that has large 
value on the diagonal and small off-diagonal values. 


Lemma 8.8 (Consequence of Gershgorin Circle Theorem). Suppose a matrix T is of the form 

\A B~\ 

T = where A, D are square diagonal matrices, and C is of dimension n x m. Then T 

o u 

is PSD if there exists a > 0 such that the following holds: An > ^ Sje[n] IQjljV € \p\ and 

D n > «Eie[m] \ c ij\>Vj e [P]- 


Proof. Let vector u = (al m ,a _1 l n ) and v = (a^ 1 l m ,al n ), where l n is n-dimensional all l’s 
vector. Then T can be written as T = vv T Q (uu T ©T), where 0 denotes the entries-wise product of 
two matrices (That is, ^4© B is a matrix with entry AijBij). Using the Gershgorin Circle Theorem 
and the conditions of the Lemma we obtain that wu T © T is PSD and therefore T is PSD. □ 


9 Conclusions and future directions 

In this paper we prove a lower bounds on the number of samples required to solve the Sparse PCA 
problem by degree-4 SoS algorithms. This extends the (spectral) degree-2 SoS lower bound for the 
problem, establishing the quadratic gap from the number of samples required by the (inefficient) 
information theoretic bound. It remains an interesting problem to extend our lower bounds to 
higher degree SoS algorithms (or even better, show that with some constant degree, one can solve 
the problem with fewer samples). One specific difficulty we encountered in trying to extend the 
lower bound to higher degree was the polynomial constraint xj = X{, capturing the discreteness of 
the hidden sparse vector. The SoS formulation of the problem without this condition is interesting 
as well, and lower bound for it may be easier. 

As mentioned, it is possible that the best way to prove strong SoS lower bounds for Sparse PCA 
is via the reduction of Berthet and Rigollet’s lBR13a| . namely by improving existing lower bounds 
for the Planted Clique problem. However, we note that this approach is limited as well, as it seems 
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that sparse PCA is significantly harder. Specifically, Planted Clique has a simple 0(log n)-degree 
SoS algorithm (and thus a quasi-polynomial time) optimal solution, whereas for Sparse PCA we 
know of no better sample-optimal algorithm than one running in exponential p°^ time. It is thus 
conceivable that one can even prove Ll(k)- degree SoS lower bounds for this problem. 

More generally, we believe that statistical and machine learning problems provide a new and 
challenging setting for testing the power and limits and SoS algorithms. While we have fairly strong 
techniques for proving optimal SoS lower bounds for combinatorial optimization problems, we lack 
similar ones for ML problems. In particular, many other problems besides Sparse PCA seem to 
exhibit the apparent trade-off between the number of samples required information theoretically 
versus via computationally efficient techniques, offering fertile ground for attempting SoS lower 
bounds establishing such trade-offs. 

Finally it would be nice to see more reductions between problems of statistical and ML nature, 
as the one by |BR13a| . Efficient reductions have proved extremely powerful in computational 
complexity theory and optimization, enabling the framework of complexity classes and complete 
problems. Creating such a framework within machine learning will hopefully expose structure on 
the relative difficulty of problems in this vast area, highlighting some problems as more central to 
attack, and enabling both new algorithms and new lower bounds. 

Acknowledgments: We would like to thank Sanjeev Arora, Boaz Barak, Philippe Rigollet and 
David Steurer for helpful discussions throughout various stages of this work. 
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